Skip to content

NODEJS-681: ControlConnection Concurrent Read and Write on .host and .connection#462

Open
toptobes wants to merge 5 commits into
apache:trunkfrom
toptobes:control-connection-refresh
Open

NODEJS-681: ControlConnection Concurrent Read and Write on .host and .connection#462
toptobes wants to merge 5 commits into
apache:trunkfrom
toptobes:control-connection-refresh

Conversation

@toptobes
Copy link
Copy Markdown

@toptobes toptobes commented May 20, 2026

This PR superceeds Jane He's #430 with her blessing

After some investigation, we were unable to figure out the root cause behind the NPEs, with there being multiple potential avenues where the issue may have originated from, and so we decided to fix the issue at the lowest and simplest level we could–simply adding a stronger concurrency control to _refresh directly via a _refreshInProgress flag

I personally believe the issue stemmed from _setHealthListeners being called multiple times on the same host/connection, causing the listeners to trigger refreshes multiple times for the same event, leading to the NPEs mentioned in the ticket.

However the issue is quite hard to organically reproduce so the theory remains a theory.

Potential trace
  1. _refresh() is called
  2. _refresh() calls _refreshControlConnection()
  3. _refreshControlConnection() fails to borrow a connection so it calls _initializeConnection()
  4. _initializeConnection() calls _setHealthListeners()
  5. _refresh() gets back in control and then also calls _setHealthListeners()

which means that there's the potential of, sequentially:

  1. A new host and connection being set (call them H1 and C1)
  2. Listeners being attached to the H1 and C1
  3. A newer host being set (call it H2)
  4. Listeners being attached to the H2 and C1 without the previous listeners being removed

Comment thread lib/control-connection.js
Comment thread lib/promise-utils.js Outdated
assert.strictEqual(cc.hosts.length, 1);
});

it('should not break when refreshing concurrently', async () => {
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may need a better heuristic for ensuring the refreshes are okay... not sure... I just "borrowed" this from Jane's original PR

Comment thread lib/control-connection.js Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses NODEJS-681 by adding concurrency protection around ControlConnection._refresh() to avoid concurrent refresh executions that can lead to inconsistent .host / .connection state, and adds an integration test intended to exercise concurrent refresh behavior.

Changes:

  • Add _refreshInProgress guard and refactor refresh logic into _unsafeDoRefresh() in ControlConnection.
  • Add an integration test that triggers many _refresh() calls.
  • Make promiseUtils.toBackground() tolerate undefined/null inputs via optional chaining.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
lib/control-connection.js Adds a refresh-in-progress guard and refactors refresh implementation into a separate method.
lib/promise-utils.js Makes toBackground() no-op safely when given a nullish value.
test/integration/short/control-connection-tests.js Adds a concurrency-focused integration test for control connection refresh.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lib/control-connection.js Outdated
Comment thread lib/control-connection.js Outdated
Comment thread lib/promise-utils.js
Comment thread test/integration/short/control-connection-tests.js
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@SiyaoIsHiding SiyaoIsHiding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, the main reason why it broke in the past, is that this.host and this.connection are a nullable type, (they are supposed to be null in certain circumstainces like refreshing), but function calls like this._setHealthListeners(this.host, this.connection); are not treating them as nullable. Imagine if this is TypeScript, it would complain that _setHealthListeners expects Host, Connection as argument while this call passes Host?, Connection?.
So, aside from the _refreshInProgress flag to make sure only one refresh is happening at one time, I think we should still make sure we are using this.host and this.connection as nullable.
That means this._setHealthListeners(this.host, this.connection); should be

if (this.host && this.connection){
this._setHealthListeners(this.host, this.connection);
}

And other places that access this.host and this.connection should also has null guards, like this.
What do you think?

Comment thread test/integration/short/control-connection-tests.js
Comment thread test/integration/short/control-connection-tests.js
@toptobes
Copy link
Copy Markdown
Author

I think, the main reason why it broke in the past, is that this.host and this.connection are a nullable type, (they are supposed to be null in certain circumstainces like refreshing), but function calls like this._setHealthListeners(this.host, this.connection); are not treating them as nullable. Imagine if this is TypeScript, it would complain that _setHealthListeners expects Host, Connection as argument while this call passes Host?, Connection?. So, aside from the _refreshInProgress flag to make sure only one refresh is happening at one time, I think we should still make sure we are using this.host and this.connection as nullable. That means this._setHealthListeners(this.host, this.connection); should be

if (this.host && this.connection){
this._setHealthListeners(this.host, this.connection);
}

And other places that access this.host and this.connection should also has null guards, like this. What do you think?

As far as I can tell, the nullability is meant to be a transient state and trying to impose these silent null checks may just cause more subtle bugs than it solves? I wouldn't be against explicit null checks in _setHealthListeners as an assertion, but simply ignoring setting the listeners could be problematic

in an ideal world we may want to use an explicit state machine but that seems pretty out of scope for this PR

@SiyaoIsHiding
Copy link
Copy Markdown
Contributor

Why

silent null checks may just cause more subtle bugs

?
I think they can potentially fix some bugs that we didn't discover yet.
But either way, I'm good. I think this PR can already fix the problem that we clearly know of.
If you fix that eslint error, I will give the explicit approval.

@toptobes
Copy link
Copy Markdown
Author

Why

silent null checks may just cause more subtle bugs

? I think they can potentially fix some bugs that we didn't discover yet. But either way, I'm good. I think this PR can already fix the problem that we clearly know of. If you fix that eslint error, I will give the explicit approval.

Just to step back for a second we need to decide if it's even valid to get to the end of _refresh() without the host or the connection being set

I don't think it should be, no?

Meaning if it happens, we're just enabling the bug to propagate silently instead of catching it and yelling to the user that that's happened and needs to be patched (which it really should never happen with this refresh fix or there's a bigger issue)

@toptobes toptobes force-pushed the control-connection-refresh branch from 1ca12ed to 91a7b81 Compare May 21, 2026 00:27
@SiyaoIsHiding
Copy link
Copy Markdown
Contributor

No it shouldn't reach the end of _refresh without this.host set.
You convinced me 👍

@toptobes toptobes requested a review from SiyaoIsHiding May 21, 2026 01:34
@SiyaoIsHiding
Copy link
Copy Markdown
Contributor

This test failure is new and it concerns me.

  1) ControlConnection
       #init()
         should subscribe to SCHEMA_CHANGE events and refresh keyspace information:
     Error: Condition still false after 100 attempts: () => cc.metadata.keyspaces['sample_change_1'].strategyOptions.replication_factor === '2'
      at whilstItem (test/test-helper.js:769:23)
      at Timeout.next [as _onTimeout] (lib/utils.js:1042:5)
      at listOnTimeout (node:internal/timers:585:17)
      at process.processTimers (node:internal/timers:521:7)

After some investigation, I think the problem is that a schema refresh triggered by an event can be permanently lost if the CC's connection is down at that moment and cannot establish new connection within 2 seconds. And this singleton refresh implementation makes it more vulnerable than before.

What might have happened:

In CI, a brief TCP hiccup (or Cassandra-side connection reset) closed the CC's socket during the 100ms debounce window after the ALTER KEYSPACE event:

  1. ALTER KEYSPACE fires → SCHEMA_CHANGE event → EventDebouncer starts 100ms timer
  2. TCP socket closes → socketClose _refresh() acquires _refreshInProgress, sets this.connection = null
  3. Debouncer fires → metadata.refreshKeyspacecc.query()connection === null _waitForReconnection() (2s timeout)
  4. _refresh() fails or does not establish new connection within 2 seconds -> _waitForReconnection rejects
  5. Error is swallowed by toBackground() — the ALTER KEYSPACE schema update is silently dropped
  6. CC eventually reconnects but no new SCHEMA_CHANGE event arrives → keyspace stays at replication_factor=3 → poll times out

In the past, when we accidentally allowed concurrent _refresh() calls, If one concurrent attempt succeeded while another failed, the newConnection(null) event would resolve the pending _waitForReconnection before the error could reject it. The new singleton approach eliminates that accidental rescue path.

Fix

Instead of allowing concurrent refresh, I think we should fix the problem of schema refresh errors being swallowed. For example, in control-connection.js _nodeSchemaChangeHandler

  // Instead of: toBackground(this.handleSchemaChange(event, false))
  this.handleSchemaChange(event, false).catch(() => {
    // CC will reconnect; re-queue this event once it does
    this.once('newConnection', (err) => {
      if (!err) toBackground(this.handleSchemaChange(event, false));
    });
  });

Copy link
Copy Markdown
Contributor

@SiyaoIsHiding SiyaoIsHiding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above

@SiyaoIsHiding
Copy link
Copy Markdown
Contributor

Bret points out that the Java driver's behavior is that every time a new control connection is established, it refreshes schema

https://github.com/apache/cassandra-java-driver/blob/90b09e0e34fd6fdda064f13f6acd0df7268a3dc6/core/src/main/java/com/datastax/oss/driver/internal/core/control/ControlConnection.java#L460-L469

It makes sense to me that Node.js driver can align with Java driver's behavior.
That would be an easier fix, too.

diff --git a/lib/control-connection.js b/lib/control-connection.js
index 5abf8b06..4199281c 100644
--- a/lib/control-connection.js
+++ b/lib/control-connection.js
@@ -415,6 +415,8 @@ class ControlConnection extends events.EventEmitter {
 
       await this.metadata.refreshKeyspacesInternal(false);
       this.metadata.initialized = true;
+    } else if (this.options.isMetadataSyncEnabled) {
+    await this.metadata.refreshKeyspacesInternal(false);
     }
   }

@toptobes
Copy link
Copy Markdown
Author

After some deliberation we think we can just call this.metadata.refreshKeyspaces[Internal?]() within _refreshHosts unconditionally as that function appears to invalidate all such schema caches. Need to triple check if this is the right solution thoguh

@SiyaoIsHiding
Copy link
Copy Markdown
Contributor

Confirming above is correct.
The question is whether refreshKeyspacesInternal is enough to cover any EVENTs potentially lost.
There are 3 kinds of EVENTs

  1. TOPOLOGY_CHANGE, handled by refreshHosts
  2. STATUS_CHANGE, a host UP or DOWN, is handled by host.js's reconnection schedule, independent from ControlConnection
  3. SCHEMA_CHANGE
    3a. refreshKeyspacesInternal redo SELECT * FROM system_schema.keyspaces
    3b. It also removes the cache for udt, table, function, aggregate, and views. They will be lazy-loaded when used for the first time.
    So I think refreshKeyspacesInternal is enough.

Copy link
Copy Markdown
Contributor

@SiyaoIsHiding SiyaoIsHiding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test still fails

1) ControlConnection
       #init()
         should subscribe to SCHEMA_CHANGE events and refresh keyspace information:
     Error: Condition still false after 100 attempts: () => !cc.metadata.keyspaces['sample_change_1']
      at whilstItem (test/test-helper.js:769:23)
      at Timeout.next [as _onTimeout] (lib/utils.js:1042:5)
      at listOnTimeout (node:internal/timers:605:17)
      at process.processTimers (node:internal/timers:541:7)

I think it's possible that

  1. When the driver is selecting keyspace sample_change_1 info, that keyspace still existed in the DB, so DB sends the response of keyspace info, but it does not reach the driver yet.
  2. The DROP statement executes.
  3. Driver delete metadata.keyspaces['sample_change_1']
  4. The sample_change_1 keyspace info finally reaches the driver
  5. Driver assigns cc.metadata.keyspaces['sample_change_1']

I need to think about how to fix this..... But I think this path is possible to fail this test.

Comment thread lib/control-connection.js
// Don't attempt to reconnect when the ControlConnection is being shutdown
if (self._isShuttingDown) {
// Don't attempt to reconnect when the ControlConnection is being shutdown
this.log('info', 'The ControlConnection will not be refreshed as the Client is being shutdown');
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
this.log('info', 'The ControlConnection will not be refreshed as the Client is being shutdown');
self.log('info', 'The ControlConnection will not be refreshed as the Client is being shutdown');

Comment thread lib/control-connection.js
this.metadata.initialized = true;
}

this.metadata.initialized = true;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this will have the side effect that if metadata sync is not enabled, this.metadata.initialized = true; will still run everytime it refreshes, not just once at initialization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants